Bivariate Analysis¶

Importing Required libraries¶

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

Loading the dataset¶

In [4]:
data= pd.read_csv(r"C:\Users\DEEL\OneDrive\Documents\Clean_course_rec2.csv")
data
Out[4]:
job_id job_title category course_title skills
0 3900085113 human resources manager bilingual spanish hr beginner to pro in powerpoint complete powerpo... ai, c, erp, powerpoint, r, training
1 3900085113 human resources manager bilingual spanish hr how to create amazing cinemagraphs with micros... ap, c, erp, microsoft, microsoft powerpoint, p...
2 3900085113 human resources manager bilingual spanish hr logo design in powerpoint design, erp, powerpoint, r
3 3900085113 human resources manager bilingual spanish hr business card design in powerpoint ar, c, ca, design, erp, powerpoint, r
4 3900085113 human resources manager bilingual spanish hr flat icon design in powerpoint c, design, erp, powerpoint, r
... ... ... ... ... ...
4303 3899516898 business development manager business development create kindle ebook covers with powerpoint c, erp, powerpoint, r
4304 3899516898 business development manager business development plantillas powerpoint para publicar en mercado... ar, c, ca, erp, lan, powerpoint, r
4305 3899516898 business development manager business development basic graphic design for powerpoint ap, c, design, erp, graphic design, phi, power...
4306 3899516898 business development manager business development how to design professional powerpoint business... design, erp, powerpoint, presentation, r
4307 3899516898 business development manager business development self advertise using powerpoint twitter and fa... c, erp, fa, ios, powerpoint, r

4308 rows × 5 columns

Checking for Unique Values¶

In [5]:
data.nunique()
Out[5]:
job_id            27
job_title         24
category           5
course_title    1946
skills           680
dtype: int64

Vectorizing Skills Feature , creating a cosine matrix and ploting a heatmap between Job Title Similarity based on Skills¶

finding how similar different job titles are based on the skills they require.
It uses TF-IDF to convert skills into numbers and cosine similarity to measure how close the jobs are to each other.
Finally, a heatmap is used to visualize the results.


Step-by-Step Explanation¶

  1. Grouping skills by job title

    • Collects all skills related to each job title.
    • Joins multiple skills into a single string so every job has one combined skill set.
  2. Converting skills into numerical features (TF-IDF)

    • TF-IDF assigns weights to each skill:
      • Common skills across many jobs get lower weight.
      • Unique/important skills get higher weight.
    • This helps highlight distinctive skills for each job.
  3. Measuring similarity between job titles

    • Cosine similarity is applied to the TF-IDF values.
    • Produces a similarity score between 0 and 1:
      • 1 → very similar job roles.
      • 0 → completely different skills.
  4. Creating a similarity matrix

    • Builds a table where both rows and columns represent job titles.
    • Each cell contains the similarity score between two job titles.
  5. Visualizing with a heatmap

    • Displays the similarity matrix as a colored grid.
    • Brighter colors = higher similarity.
    • Darker colors = lower similarity.
    • Makes it easy to spot clusters of jobs with overlapping skills.
In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

job_skills = data.groupby('job_title')['skills'].apply(lambda x: ', '.join(x)).reset_index()

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(job_skills['skills'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
similarity_df = pd.DataFrame(cosine_sim, index=job_skills['job_title'], columns=job_skills['job_title'])

plt.figure(figsize=(12, 10))
sb.heatmap(similarity_df, cmap='viridis')
plt.title('Job Title Similarity based on Skills')
plt.xlabel('Job Title')
plt.ylabel('Job Title')
plt.show()
No description has been provided for this image

Box PLot to detect the number of skills required in particular catagory¶

This code calculates how many skills each job listing has and then visualizes the distribution of skill counts across job categories using a box plot.


Step-by-Step Explanation¶

  1. Counting the number of skills per job

    • Each job’s skills are stored as a comma-separated list.
    • The number of commas is counted, and +1 is added (since the number of items = commas + 1).
    • A new column skill_count is created to store this value for each job.
  2. Creating a box plot

    • The x-axis represents different job categories.
    • The y-axis represents the number of skills required for jobs in that category.
    • Each box shows:
      • Median line → the typical number of skills required.
      • Box edges (Q1 and Q3) → the middle 50% of skill counts.
      • Whiskers → the range of most data points.
      • Outliers → jobs that require unusually high or low numbers of skills.
  3. Improving visualization

    • The plot size is increased for clarity.
    • Category names on the x-axis are rotated to prevent overlap.
In [7]:
data['skill_count'] = data['skills'].str.count(',') + 1

# Plotting the box plot
plt.figure(figsize=(12, 7))
sb.boxplot(x='category', y='skill_count', data=data)
plt.title('Distribution of Skill Count per Job Category')
plt.xlabel('Job Category')
plt.ylabel('Number of Skills')
plt.xticks(rotation=45) # Rotate category names if they overlap
plt.show()
No description has been provided for this image

Validating the outliers existing in finace catagory courses¶


Step-by-Step Explanation¶

  1. Filter the Finance category

    • Select only rows where the job category is finance.
    • This narrows down the dataset to just finance-related courses/jobs.
  2. Calculate Q1, Q3, and IQR

    • Q1 (25th percentile): The value below which 25% of the data falls.
    • Q3 (75th percentile): The value below which 75% of the data falls.
    • IQR (Interquartile Range): The difference Q3 - Q1, showing the middle spread of the data.
  3. Determine the outlier threshold

    • Outliers are defined as values greater than Q3 + 1.5 * IQR.
    • This is a standard statistical rule for identifying unusually high data points.
  4. Filter out the outliers

    • Identify all finance jobs/courses where the skill count is above this threshold.
    • These rows are potential outliers.
  5. Display the results

    • Show relevant details (course_title, skills, skill_count) for the outlier rows.
    • Sorting them by skill_count helps in examining the most extreme cases.

Why We Are Doing This¶

  • The box plot revealed that the Finance category has many outliers in terms of required skills.

  • We wanted to check if these outliers were due to:

    • Repeated skills being listed multiple times, or
    • Legitimately high skill requirements.
  • After inspection, we found that the skills listed were legitimate, meaning that finance courses/jobs often demand significantly more skills than typical categories.


Purpose of the Code¶

  • To validate whether outliers in the Finance category are data quality issues or true skill requirements.
  • Helps confirm that the Finance sector legitimately requires a broader and deeper set of skills compared to other categories.
In [13]:
finance_data = data[data['category'] == 'finance']

Q1 = finance_data['skill_count'].quantile(0.25)
Q3 = finance_data['skill_count'].quantile(0.75)
IQR = Q3 - Q1

outlier_threshold = Q3 + 1.5 * IQR

print(f"Finance Category Q1: {Q1}")
print(f"Finance Category Q3: {Q3}")
print(f"Finance Category IQR: {IQR}")
print(f"Anything above {outlier_threshold:.2f} skills is an outlier.")

# 4. Filter the DataFrame to find and display the outliers
finance_outliers = finance_data[finance_data['skill_count'] > outlier_threshold ]


# Display the interesting columns for these outlier rows
print("\n--- Outliers in the Finance Category ---")
finance_outliers[['course_title', 'skills','skill_count']].sort_values(by='skill_count')
Finance Category Q1: 3.0
Finance Category Q3: 5.0
Finance Category IQR: 2.0
Anything above 8.00 skills is an outlier.

--- Outliers in the Finance Category ---
Out[13]:
course_title skills skill_count
2429 build enterprise applications with angular 2 a... ap, applications, ar, c, ca, enterprise applic... 9
2910 how to make it work successfully in capital ma... ap, api, ar, c, ca, capital markets, make, r, sf 9
2922 canva graphics design essential training for e... ai, ap, c, ca, canva, design, phi, r, training 9
2939 canva graphic design theory volume1 ap, c, ca, canva, design, graphic design, heor... 9
3039 graphic design double your sales with canva ap, c, ca, canva, design, graphic design, phi,... 9
3119 canva graphic design theory volume2 ap, c, ca, canva, design, graphic design, heor... 9
3682 build enterprise applications with angular 2 a... ap, applications, ar, c, ca, enterprise applic... 9
4179 learn facebook flux architecture for web appli... ap, applications, ar, c, ca, fa, pp, r, web ap... 9
3730 ruby on rails training and skills to build web... ai, ap, applications, c, ca, pp, r, training, ... 9
3769 start web development with gis map in javascript ap, ar, c, development, java, javascript, pm, ... 9
3806 master electron desktop apps using html, javas... ap, c, css, html, java, javascript, ml, pp, r 9
3864 web application development using redis, expre... ap, application development, c, ca, developmen... 9
3987 html5 and css3 learn web design with html css ... ap, ar, c, css, design, html, ml, r, web design 9
4132 javascript promises applications in es6 and an... ap, applications, ar, c, ca, java, javascript,... 9
4175 web application development learn by building ... ap, application development, ar, c, ca, develo... 9
3740 servlets and jsps tutorial learn web applicati... ap, applications, ar, c, ca, java, pp, r, web ... 9
2780 sensitivity scenario analysis for ca cfa cpa e... ar, c, ca, cfa, cpa, fa, r, scenario analysis,... 9
2800 financial management capital market instruments ap, api, ar, c, ca, cia, financial management,... 9
2530 web application development learn by building ... ap, application development, ar, c, ca, develo... 9
2441 ruby on rails training and skills to build web... ai, ap, applications, c, ca, pp, r, training, ... 9
2443 servlets and jsps tutorial learn web applicati... ap, applications, ar, c, ca, java, pp, r, web ... 9
2456 master electron desktop apps using html, javas... ap, c, css, html, java, javascript, ml, pp, r 9
2464 web application development using redis, expre... ap, application development, c, ca, developmen... 9
2520 javascript promises applications in es6 and an... ap, applications, ar, c, ca, java, javascript,... 9
2532 learn facebook flux architecture for web appli... ap, applications, ar, c, ca, fa, pp, r, web ap... 9
4198 learn web development by creating a social net... ar, c, cia, development, net, network, pm, r, ... 9
3856 build your own calculator app with javascript,... ap, c, ca, css, html, java, javascript, ml, pp, r 10
4176 rails ecommerce app with html template from th... ai, ap, c, ecommerce, html, mef, ml, pp, r, rest 10
2653 risk analysis capital budgeting for ca cs cfa ... ap, api, budgeting, c, ca, capital budgeting, ... 10
2531 rails ecommerce app with html template from th... ai, ap, c, ecommerce, html, mef, ml, pp, r, rest 10
2543 fintech and the transformation in financial se... banking, blockchain, business transformation, ... 10
4049 learn html css how to start your web developme... ar, c, ca, css, development, html, ml, pm, r, ... 10
3784 web development with html css bootstrap jquery... ap, c, css, development, html, jquery, ml, pm,... 10
2833 school of raising capital agile financial mode... agile, ai, ap, api, c, ca, cia, financial mode... 10
2463 build your own calculator app with javascript,... ap, c, ca, css, html, java, javascript, ml, pp, r 10
2652 working capital management for ca cfa cpa exams ap, api, c, ca, capital management, cfa, cpa, ... 11
2544 supply chain finance and blockchain technology accounts payable and receivable, blockchain, c... 14
2539 intuit academy bookkeeping accounting, accounting software, accounts paya... 14
2542 digital transformation in financial services banking, blockchain, business analysis, busine... 18
2538 introduction to finance and accounting account management, accounting, accounts payab... 21
2541 financial management account management, accounting, accounts payab... 23
2540 business and financial modeling accounting, basic descriptive statistics, busi... 28
2537 business foundations account management, accounting, accounts payab... 36

Plotting scatter plot to analyze the relation of skills and courses¶

This code clusters courses based on the similarity of their required skills and visualizes the clusters using Principal Component Analysis (PCA) in an interactive Plotly scatter plot.


Step-by-Step Explanation¶

  1. Group skills by course

    • Combines all skills listed under each course_title into a single string.
    • Ensures that each course has one consolidated skill set to analyze.
  2. Convert skills into numerical features (TF-IDF)

    • TfidfVectorizer converts skill text into numerical vectors.
    • Assigns higher weights to unique skills and lower weights to common ones.
    • This allows us to represent each course in terms of distinctive skills.
  3. Compute similarity between courses

    • cosine_similarity measures how close two courses are based on skills.
    • Produces a similarity matrix where values range from:
      • 1 → very similar courses (skills overlap a lot).
      • 0 → very different courses (skills do not overlap).
  4. Dimensionality reduction using PCA

    • Cosine similarity produces a high-dimensional matrix.
    • PCA reduces this data to 2 dimensions (PCA1, PCA2) for visualization.
    • Courses with similar skills will be positioned closer together in this reduced space.
  5. Create a DataFrame for visualization

    • Stores the PCA results in a DataFrame.
    • Adds the corresponding course titles for reference.
  6. Visualize with an interactive scatter plot (Plotly)

    • Each point represents a course.
    • Courses closer to each other indicate similar skill sets.
    • Hovering over points shows the course title.
    • Makes it easy to explore clusters and identify related courses interactively.

Purpose of the Code¶

  • To cluster courses based on their skills and visualize relationships.
  • Helps answer questions like:
    • Which courses have overlapping skill requirements?
    • Are there clear clusters of courses focusing on similar topics?
    • Can we spot outlier courses that require very unique skills?
In [10]:
!pip install plotly
import plotly.express as px
from sklearn.decomposition import PCA

course_skills = data.groupby('course_title')['skills'].apply(lambda x: ', '.join([skill for sublist in x for skill in (sublist if isinstance(sublist, list) else [sublist])])).reset_index()


vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(course_skills['skills'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)


pca = PCA(n_components=2)
pca_result = pca.fit_transform(cosine_sim)

 
pca_df = pd.DataFrame(pca_result, columns=['PCA1', 'PCA2'])
pca_df['course_title'] = course_skills['course_title']


fig = px.scatter(pca_df, x='PCA1', y='PCA2', 
                hover_data=['course_title'], 
                title='Course Clustering by Skills')
fig.show()
Requirement already satisfied: plotly in e:\ana\new folder\lib\site-packages (5.24.1)
Requirement already satisfied: tenacity>=6.2.0 in e:\ana\new folder\lib\site-packages (from plotly) (8.2.3)
Requirement already satisfied: packaging in e:\ana\new folder\lib\site-packages (from plotly) (24.1)